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Abstract 

This paper introduces the Ubuntu Dia¬ 
logue Corpus, a dataset containing almost 
1 million multi-turn dialogues, with a to¬ 
tal of over 7 million utterances and 100 
million words. This provides a unique re¬ 
source for research into building dialogue 
managers based on neural language mod¬ 
els that can make use of large amounts 
of unlabeled data. The dataset has both 
the multi-turn property of conversations 
in the Dialog State Tracking Challenge 
datasets, and the unstructured nature of in¬ 
teractions from microblog services such 
as Twitter. We also describe two neural 
learning architectures suitable for analyz¬ 
ing this dataset, and provide benchmark 
performance on the task of selecting the 
best next response. 

1 Introduction 

The ability for a computer to converse in a nat¬ 
ural and coherent manner with a human has long 
been held as one of the primary objectives of artifi¬ 
cial intelligence (Al). In this paper we consider the 
problem of building dialogue agents that have the 
ability to interact in one-on-one multi-turn con¬ 
versations on a diverse set of topics. We primar¬ 
ily target unstructured dialogues, where there is 
no a priori logical representation for the informa¬ 
tion exchanged during the conversation. This is in 
contrast to recent systems which focus on struc¬ 
tured dialogue tasks, using a slot-filling represen¬ 
tation |[T0ll27l[^ . 

We observe that in several subfields of Al— 
computer vision, speech recognition, machine 
translation—fundamental break-throughs were 
achieved in recent years using machine learning 
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methods, more specifically with neural architec¬ 
tures lH; however, it is worth noting that many 
of the most successful approaches, in particular 
convolutional and recurrent neural networks, were 
known for many years prior. It is therefore rea¬ 
sonable to attribute this progress to three major 
factors: 1) the public distribution of very large 
rich datasets ||5l, 2) the availability of substantial 
computing power, and 3) the development of new 
training methods for neural architectures, in par¬ 
ticular leveraging unlabeled data. Similar progress 
has not yet been observed in the development of 
dialogue systems. We hypothesize that this is due 
to the lack of sufficiently large datasets, and aim 
to overcome this barrier by providing a new large 
corpus for research in multi-turn conversation. 

The new Ubuntu Dialogue Corpus consists of 
almost one million two-person conversations ex¬ 
tracted from the Ubuntu chat logfl used to receive 
technical support for various Ubuntu-related prob¬ 
lems. The conversations have an average of 8 turns 
each, with a minimum of 3 turns. All conversa¬ 
tions are carried out in text form (not audio). The 
dataset is orders of magnitude larger than struc¬ 
tured corpuses such as those of the Dialogue State 
Tracking Challenge If32l . It is on the same scale as 
recent datasets for solving problems such as ques¬ 
tion answering and analysis of microblog services, 
such as Twitter Il22ll25l 12^13^ . but each conversa¬ 
tion in our dataset includes several more turns, as 
well as longer utterances. Furthermore, because 
it targets a specific domain, namely technical sup¬ 
port, it can be used as a case study for the devel¬ 
opment of Al agents in targeted applications, in 
contrast to chatbox agents that often lack a well- 
defined goal l[26l . 

In addition to the corpus, we present learning 
architectures suitable for analyzing this dataset, 
ranging from the simple frequency-inverse docu- 

'These logs are available from 2004 to 2015 at http: 
//irclogs.ubuntu.com/ 





ment frequency (TF-IDF) approach, to more so¬ 
phisticated neural models including a Recurrent 
Neural Network (RNN) and a Long Short-Term 
Memory (LSTM) architecture. We provide bench¬ 
mark performance of these algorithms, trained 
with our new corpus, on the task of selecting the 
best next response, which can be achieved with¬ 
out requiring any human labeling. The dataset is 
ready for public release]^ The code developed for 
the empirical results is also available]^ 

2 Related Work 

We briefly review existing dialogue datasets, and 
some of the more recent learning architectures 
used for both structured and unstructured dia¬ 
logues. This is by no means an exhaustive list 
(due to space constraints), but surveys resources 
most related to our contribution. A list of datasets 
discussed is provided in Table [T] 

2.1 Dialogue Datasets 

The Switchboard dataset |l8l, and the Dialogue 
State Tracking Challenge (DSTC) datasets ll^ 
have been used to train and validate dialogue man¬ 
agement systems for interactive information re¬ 
trieval. The problem is typically formalized as a 
slot filling task, where agents attempt to predict 
the goal of a user during the conversation. These 
datasets have been significant resources for struc¬ 
tured dialogues, and have allowed major progress 
in this field, though they are quite small compared 
to datasets currently used for training neural archi¬ 
tectures. 

Recently, a few datasets have been used con¬ 
taining unstructured dialogues extracted from 
TwitteiQ Ritter et al. ll^ collected 1.3 million 
conversations; this was extended in ESI to take ad¬ 
vantage of longer contexts by using A-B-A triples. 
Shang et al. E5ll used data from a similar Chinese 
website called WeibcQ However to our knowl¬ 
edge, these datasets have not been made public, 
and furthermore, the post-reply format of such mi¬ 
croblogging services is perhaps not as represen¬ 
tative of natural dialogue between humans as the 
continuous stream of messages in a chat room. In 

^Note that a new version of the dataset is now 
available: https : //github. com/rkadlec/ 

ubuntu-ranking-dataset-creator This ver¬ 
sion makes some adjustments and fixes some bugs from the 
first version. 

^ http://github.com/npow/ubottu 
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fact, Ritter et al. estimate that only 37% of posts 
on Twitter are ‘conversational in nature’, and 69% 
of their collected data contained exchanges of only 
length 2 ETl . We hypothesize that chat-room style 
messaging is more closely correlated to human-to- 
human dialogue than micro-blogging websites, or 
forum-based sites such as Reddit. 

Part of the Ubuntu chat logs have previously 
been aggregated into a dataset, called the Ubuntu 
Chat Corpus If30l . However that resource pre¬ 
serves the multi-participant structure and thus is 
less amenable to the investigation of more tradi¬ 
tional two-party conversations. 

Also weakly related to our contribution is the 
problem of question-answer systems. Several 
datasets of question-answer pairs are available [3ll, 
however these interactions are much shorter than 
what we seek to study. 

2.2 Learning Architectures 

Most dialogue research has historically focused 
on structured slot-filling tasks E^ . Various ap¬ 
proaches were proposed, yet few attempts lever¬ 
age more recent developments in neural learning 
architectures. A notable exception is the work of 
Henderson et al. CB, which proposes an RNN 
structure, initialized with a denoising autoencoder, 
to tackle the DSTC 3 domain. 

Work on unstructured dialogues, recently pi¬ 
oneered by Ritter et al. E2l . proposed a re¬ 
sponse generation model for Twitter data based on 
ideas from Statistical Machine Translation. This 
is shown to give superior performance to previ¬ 
ous information retrieval (e.g. nearest neighbour) 
approaches |[T4ll . This idea was further devel¬ 
oped by Sordoni et al. E^ to exploit information 
from a longer context, using a structure similar to 
the Recurrent Neural Network Encoder-Decoder 
model [41. This achieves rather poor performance 
on A-B-A Twitter triples when measured by the 
BLEU score (a standard for machine translation), 
yet performs comparatively better than the model 
of Ritter et al. E^ . Their results are also verified 
with a human-subject study. A similar encoder- 
decoder framework is presented in E5l . This 
model uses one RNN to transform the input to 
some vector representation, and another RNN to 
‘decode’ this representation to a response by gen¬ 
erating one word at a time. This model is also eval¬ 
uated in a human-subject study, although much 
smaller in size than in ESl . Overall, these models 



Dataset 

Type 

Task 

# Dialogues 

# Utterances 

# Words 

Description 

Switchboard 

Human-human 

spoken 

Various 

2,400 

— 

3,000.000 

Telephone conversations 
on pre-specified topics 

DSTCl I32l 

Human-computer 

spoken 

State 

tracking 

15,000 

210,000 


Bus ride information 
system 

DSTC2 llOl 

Human-computer 

spoken 

State 

tracking 

3,000 

24,000 

— 

Restaurant booking 
system 

DSTC3 0 

Human-computer 

spoken 

State 

tracking 

2,265 

15,000 

— 

Tourist information 
system 

DSTC4I13I 

Human-human 

spoken 

State 

tracking 

35 

— 

— 

21 hours of tourist info 
exchange over Skype 

Twitter 

Corpus 1211 

Human-human 

micro-blog 

Next utterance 
generation 

1.300,000 

3,000,000 

— 

Post/ replies extracted 
from Twitter 

Twitter Triple 
Corpus 1281 

Human-human 

micro-blog 

Next utterance 
generation 

29,000.000 

87,000.000 

— 

A-B-A triples from 

Twitter replies 

Sina Weibo |25| 

Human-human 

micro-blog 

Next utterance 
generation 

4.435.959 

8.871,918 

— 

Post/ reply pairs extracted 
from Weibo 

Ubuntu Dialogue 
Corpus 

Human-human 

chat 

Next utterance 
classification 

930,000 

7,100,000 

100,000,000 

Extracted from Ubuntu 
Chat Logs 


Table 1: A selection of structured and unstructured large-scale datasets applicable to dialogue systems. 
Faded datasets are not publicly available. The last entry is our contribution. 


highlight the potential of neural learning architec¬ 
tures for interactive systems, yet so far they have 
been limited to very short conversations. 

3 The Ubuntu Dialogue Corpus 

We seek a large dataset for research in dialogue 
systems with the following properties: 

• Two-way (or dyadic) conversation, as op¬ 
posed to multi-participant chat, preferably 
human-human. 

• Large number of conversations; 10® — 10® 
is typical of datasets used for neural-network 
learning in other areas of AT 

• Many conversations with several turns (more 
than 3). 

• Task-specihc domain, as opposed to chatbot 
systems. 

All of these requirements are satished by the 
Ubuntu Dialogue Corpus presented in this paper. 

3.1 Ubuntu Chat Logs 

The Ubuntu Chat Logs refer to a collection of logs 
from Ubuntu-related chat rooms on the Freenode 
Internet Relay Chat (IRC) network. This protocol 
allows for real-time chat between a large number 
of participants. Each chat room, or channel, has 
a particular topic, and every channel participant 
can see all the messages posted in a given chan¬ 
nel. Many of these channels are used for obtaining 
technical support with various Ubuntu issues. 

As the contents of each channel are moderated, 
most interactions follow a similar pattern. A new 
user joins the channel, and asks a general ques¬ 
tion about a problem they are having with Ubuntu. 
Then, another more experienced user replies with 


a potential solution, after hrst addressing the ’user- 
name’ of the hrst user. This is called a name men¬ 
tion |[29l . and is done to avoid confusion in the 
channel — at any given time during the day, there 
can be between 1 and 20 simultaneous conversa¬ 
tions happening in some channels. In the most 
popular channels, there is almost never a time 
when only one conversation is occurring; this ren¬ 
ders it particularly problematic to extract dyadic 
dialogues. A conversation between a pair of users 
generally stops when the problem has been solved, 
though some users occasionally continue to dis¬ 
cuss a topic not related to Ubuntu. 

Despite the nature of the chat room being a con¬ 
stant stream of messages from multiple users, it is 
through the fairly rigid structure in the messages 
that we can extract the dialogues between users. 
Figure shows an example chat room conversa¬ 
tion from the #ubuntu channel as well as the ex¬ 
tracted dialogues, which illustrates how users usu¬ 
ally state the username of the intended message 
recipient before writing their reply (we refer to all 
replies and initial questions as ‘utterances’). For 
example, it is clear that users ‘Taru’ and ‘kuja’ 
are engaged in a dialogue, as are users ‘Old’ and 
‘bur[n]er’, while user ‘_pm’ is asking an initial 
question, and ‘LiveCD’ is perhaps elaborating on 
a previous comment. 

3.2 Dataset Creation 

In order to create the Ubuntu Dialogue Corpus, 
hrst a method had to be devised to extract dyadic 
dialogues from the chat room multi-party conver¬ 
sations. The hrst step was to separate every mes¬ 
sage into 4-tuples of (time, sender, recipient, utter¬ 
ance). Given these 4-tuples, it is straightforward to 




















group all tuples where there is a matching sender 
and recipient. Although it is easy to separate the 
time and the sender from the rest, finding the in¬ 
tended recipient of the message is not always triv¬ 
ial. 

3.2.1 Recipient Identification 

While in most cases the recipient is the first word 
of the utterance, it is sometimes located at the end, 
or not at all in the case of initial questions. Fur¬ 
thermore, some users choose names correspond¬ 
ing to common English words, such as ‘the’ or 
‘stop’, which could lead to many false positives. 
In order to solve this issue, we create a dictionary 
of usernames from the current and previous days, 
and compare the first word of each utterance to its 
entries. If a match is found, and the word does 
not correspond to a very common English worcQ 
it is assumed that this user was the intended recip¬ 
ient of the message. If no matches are found, it is 
assumed that the message was an initial question, 
and the recipient value is left empty. 

3.2.2 Utterance Creation 

The dialogue extraction algorithm works back¬ 
wards from the first response to find fhe inifial 
question fhaf was replied fo, wifhin a lime frame 
of 3 minufes. A firsf response is identified by fhe 
presence of a recipienf name (someone from fhe 
recenf conversation hisfory). The inifial quesfion 
is idenlified fo be fhe mosf recenf ufferance by fhe 
recipienf idenfified in fhe firsf response. 

All ufferances fhaf do nof qualify as a firsf re¬ 
sponse or an inifial quesfion are discarded; inifial 
questions fhaf do nof generate any response are 
also discarded. We addifionally discard conversa¬ 
tions longer fhan five ufferances where one user 
says more fhan 80% of fhe ufferances, as fhese are 
fypically nof represenfafive of real chaf dialogues. 
Einally, we consider only exfracled dialogues fhaf 
consisf of 3 fums or more fo encourage fhe model¬ 
ing of longer-term dependencies. 

To alleviafe fhe problem of ‘holes’ in fhe dia¬ 
logue, where one user does nof address fhe ofher 
explicifly, as in Eigure we check whefher each 
user falks fo someone else for fhe durafion of fheir 
conversation. If nof, all non-addressed ufferances 
are added fo fhe dialogue. An example conversa¬ 
tion along wifh fhe exfracfed dialogues is shown 
in Eigure]^ Nofe fhaf we also concafenafe all con- 
secufive utterances from a given user. 

®We use the GNU Aspell spell checking dictionary. 



Eigure 1: Plof of number of conversafions wifh a 
given number of furns. Bofh axes use a log scale. 


# dialogues (human-human) 

930,000 

# utterances (in total) 

7,100,000 

# words (in total) 

100,000,000 

Min. # turns per dialogue 

3 

Avg. # turns per dialogue 

7.71 

Avg. # words per utterance 

10.34 

Median conversation length (min) 

6 


Table 2: Properties of Ubunfu Dialogue Corpus. 


We do nof apply any furlher pre-processing (e.g. 
fokenizafion, sfemming) fo fhe dafa as released in 
fhe Ubunfu Dialogue Corpus. However fhe use of 
pre-processing is sfandard for mosf NEP sysfems, 
and was also used in our analysis (see Section |^) 

3.2.3 Special Cases and Limitations 

It is often the case that a user will post an ini¬ 
tial question, and multiple people will respond to 
it with different answers. In this instance, each 
conversation between the first user and the user 
who replied is treated as a separate dialogue. This 
has the unfortunate side-effect of having the ini¬ 
tial question appear multiple times in several dia¬ 
logues. However the number of such cases is suf¬ 
ficiently small compared to the size of the dataset. 

Another issue to note is that the utterance post¬ 
ing time is not considered for segmenting conver¬ 
sations between two users. Even if two users have 
a conversation that spans multiple hours, or even 
days, this is treated as a single dialogue. However, 
such dialogues are rare. We include the posting 
time in the corpus so that other researchers may 
filter as desired. 

3.3 Dataset Statistics 

Table [^summarizes properties of the Ubuntu Dia¬ 
logue Corpus. One of the most important features 












of the Ubuntu chat logs is its size. This is cru¬ 
cial for research into building dialogue managers 
based on neural architectures. Another important 
characteristic is the number of turns in these dia¬ 
logues. The distribution of the number of turns is 
shown in Figure [T] It can be seen that the num¬ 
ber of dialogues and turns per dialogue follow an 
approximate power law relationship. 

3.4 Test Set Generation 

We set aside 2% of the Ubuntu Dialogue Corpus 
conversations (randomly selected) to form a test 
set that can be used for evaluation of response se¬ 
lection algorithms. Compared to the rest of the 
corpus, this test set has been further processed to 
extract a pair of (context, response, flag) triples 
from each dialogue. The flag is a Boolean vari¬ 
able indicating whether or not the response was the 
actual next utterance after the given context. The 
response is a target (output) utterance which we 
aim to correctly identify. The context consists of 
the sequence of utterances appearing in dialogue 
prior to the response. We create a pair of triples, 
where one triple contains the correct response (i.e. 
the actual next utterance in the dialogue), and the 
other triple contains a false response, sampled ran¬ 
domly from elsewhere within the test set. The flag 
is set to 1 in the first case and to 0 in the second 
case. An example pair is shown in Table To 
make the task harder, we can move from pairs of 
responses (one correct, one incorrect) to a larger 
set of wrong responses (all with fiag=0). In our 
experiments below, we consider both the case of 1 
wrong response and 10 wrong responses. 


Context 

Response 

Flag 

well, can I move the drives? 
_EOS_ah not like that 

I guess I could just 
get an enclosure and 
copy via USB 

1 

well, can I move the drives? 
_EOS_ah not like that 

you can use "ps ax” 
and "kill (PID #)” 

0 


Table 3: Test set example with (context, reply, 

flag) format. The ’_EOS_’ tag is used to denote 

the end of an utterance within the context. 

Since we want to learn to predict all parts of a 
conversation, as opposed to only the closing state¬ 
ment, we consider various portions of context for 
the conversations in the test set. The context size is 
determined stochastically using a simple formula: 

c = min(f — 1, n — 1), 

IflU' 

where n =-h 2, ry ~ Unif{C/2, IOC) 

V 


Here, C denotes the maximum desired context 
size, which we set to C = 20. The last term is 
the desired minimum context size, which we set 
to be 2. Parameter t is the actual length of that 
dialogue (thus the constraint that c < f — 1), and 
n is a random number corresponding to the ran¬ 
domly sampled context length, that is selected to 
be inversely proportional to C. 

In practice, this leads to short test dialogues 
having short contexts, while longer dialogues are 
often broken into short or medium-length seg¬ 
ments, with the occasional long context of 10 or 
more turns. 

3.5 Evaluation Metric 

We consider the task of best response selection. 
This can be achieved by processing the data as de¬ 
scribed in Section |3.4[ without requiring any hu¬ 
man labels. This classification task is an adapta¬ 
tion of the recall and precision metrics previously 
applied to dialogue datasets EdI . 

A family of metrics often used in language tasks 
is Recall@k (denoted R@1 R@2, R@5 below). 
Here the agent is asked to select the k most likely 
responses, and it is correct if the true response is 
among these k candidates. Only the R@ 1 metric 
is relevant in the case of binary classification (as 
in the Table [^example). 

Although a language model that performs well 
on response classification is not a gauge of good 
performance on next utterance generation, we hy¬ 
pothesize that improvements on a model with re¬ 
gards to the classification task will eventually lead 
to improvements for the generation task. See Sec- 
tionj^for further discussion of this point. 

4 Learning Architectures for 
Unstructured Dialogues 

To provide further evidence of the value of 
our dataset for research into neural architectures 
for dialogue managers, we provide performance 
benchmarks for two neural learning algorithms, as 
well as one naive baseline. The approaches con¬ 
sidered are: TF-IDF, Recurrent Neural networks 
(RNN), and Fong Short-Term Memory (FSTM). 
Prior to applying each method, we perform stan¬ 
dard pre-processing of the data using the NFTK[^ 
library and Twitter tokenizeij^to parse each utter¬ 
ance. We use generic tags for various word cat- 

'WWW .nltk.org/ 

^http://www.ark.cs.emu.edu/TweetNLP/ 









c = ht 


egories, such as names, locations, organizations, 
URLs, and system paths. 

To train the RNN and LSTM architectures, we 
process the full training Ubuntu Dialogue Corpus 
into the same format as the test set described in 
Section |3.4[ extracting (context, response, flag) 
triples from dialogues. For the training set, we 
do not sample the context length, but instead con¬ 
sider each utterance (starting at the 3rd one) as a 
potential response, with the previous utterances as 
its context. So a dialogue of length 10 yields 8 
training examples. Since these are overlapping, 
they are clearly not independent, but we consider 
this a minor issue given the size of the dataset (we 
further alleviate the issue by shuffling the training 
examples). Negative responses are selected at ran¬ 
dom from the rest of the training data. 

4.1 TF-IDF 

Term frequency-inverse document frequency is a 
statistic that intends to capture how important a 
given word is to some document, which in our case 
is the context |[20l . It is a technique often used in 
document classification and information retrieval. 
The ‘term-frequency’ term is simply a count of the 
number of times a word appears in a given context, 
while the ‘inverse document frequency’ term puts 
a penalty on how often the word appears elsewhere 
in the corpus. The final score is calculated as the 
product of these two terms, and has the form: 

N 

tMf{w,d,D) = f{w,d)xlog——- -, 

\{d £ D : w £ d}\ 

where f{w, d) indicates the number of times word 
w appeared in context d, N is the total number 
of dialogues, and the denominator represents the 
number of dialogues in which the word w appears. 

For classification, the TF-IDF vectors are first 
calculated for the context and each of the candi¬ 
date responses. Given a set of candidate response 
vectors, the one with the highest cosine similarity 
to the context vector is selected as the output. For 
Recall@k, the top k responses are returned. 

4.2 RNN 

Recurrent neural networks are a variant of neural 
networks that allows for time-delayed directed cy¬ 
cles between units lITTll . This leads to the forma¬ 
tion of an internal state of the network, ht, which 
allows it to model time-dependent data. The in¬ 
ternal state is updated at each time step as some 



Figure 2: Diagram of our model. The RNNs have 
tied weights, c, r are the last hidden states from 
the RNNs. Cj, r* are word vectors for the context 
and response, i < t. We consider contexts up to a 
maximum of f = 160. 

function of the observed variables xt, and the hid¬ 
den state at the previous time step ht-i- Wx and 
Wh are matrices associated with the input and hid¬ 
den state. 

ht = f{Whht-i + WxXt). 

A diagram of an RNN can be seen in Figure 
RNNs have been the primary building block of 
many current neural language models l[22l |28l, 
which use RNNs for an encoder and decoder. The 
first RNN is used to encode the given context, 
and the second RNN generates a response by us¬ 
ing beam-search, where its initial hidden state is 
biased using the final hidden state from the first 
RNN. In our work, we are concerned with classi¬ 
fication of responses, instead of generation. We 
build upon the approach in f2||, which has also 
been recently applied to the problem of question 
answering 1(3^ . 

We utilize a Siamese network consisting of two 
RNNs with tied weights to produce the embed¬ 
dings for the context and response. Given some 
input context and response, we compute their em¬ 
beddings — c,r £ M'^, respectively — by feeding 
the word embeddings one at a time into its respec¬ 
tive RNN. Word embeddings are initialized using 
the pre-trained vectors (Common Crawl, 840B to¬ 
kens from lfT9l ). and fine-tuned during training. 
The hidden state of the RNN is updated at each 
step, and the final hidden state represents a sum¬ 
mary of the input utterance. Using the final hid¬ 
den states from both RNNs, we then calculate the 
probability that this is a valid pair: 

p(flag = l|c, r, M) = a{(FMr -[- b), 
























where the bias b and the matrix M G are 

learned model parameters. This can be thought 
of as a generative approach; given some input re¬ 
sponse, we generate a context with the product 
c' = Mr, and measure the similarity to the actual 
context using the dot product. This is converted 
to a probability with the sigmoid function. The 
model is trained by minimizing the cross entropy 
of all labeled (context, response) pairs |[^ : 

^ = -^^ogp{fiagJcn,rn,M) + ^\\e = \\l 

n 

where 110| is the Frobenius norm of 0 = {M, b}. 
In our experiments, we use A = 0 for computa¬ 
tional simplicity. 

For training, we used a 1:1 ratio between true re¬ 
sponses (flag =1), and negative responses (flag=0) 
drawn randomly from elsewhere in the training 
set. The RNN architecture is set to 1 hidden layer 
with 50 neurons. The Wh matrix is initialized us¬ 
ing orthogonal weights Il23]l . while Wx is initial¬ 
ized using a uniform distribution with values be¬ 
tween -0.01 and 0.01. We use Adam as our opti¬ 
mizer lITSl . with gradients clipped to 10. We found 
that weight initialization as well as the choice of 
optimizer were critical for training the RNNs. 

4.3 LSTM 

In addition to the RNN model, we consider the 
same architecture but changed the hidden units 
to long-short term memory (LSTM) units 1(121 . 
LSTMs were introduced in order to model longer- 
term dependencies. This is accomplished using a 
series of gates that determine whether a new in¬ 
put should be remembered, forgotten (and the old 
value retained), or used as output. The error sig¬ 
nal can now be fed back indefinitely into the gates 
of the LSTM unit. This helps overcome the van¬ 
ishing and exploding gradient problems in stan¬ 
dard RNNs, where the error gradients would oth¬ 
erwise decrease or increase at an exponential rate. 
In training, we used 1 hidden layer with 200 neu¬ 
rons. The hyper-parameter configuration (includ¬ 
ing number of neurons) was optimized indepen¬ 
dently for RNNs and LSTMs using a validation 
set extracted from the training data. 

5 Empirical Results 

The results for the TF-IDF, RNN, and LSTM mod¬ 
els are shown in Table (H The models were eval¬ 
uated using both 1 (1 in 2) and 9 (1 in 10) false 


examples. Of course, the Recall@2 and Recall@5 
are not relevant in the binary classification casej^ 


Method 

TE-IDE 

RNN 

ESTM 

1 in2R@l 

65.9% 

76.8% 

87.8% 

1 in 10R@1 

41.0% 

40.3% 

60.4% 

1 in 10R@2 

54.5% 

54.7% 

74.5% 

1 in 10R@5 

70.8% 

81.9% 

92.6% 


Table 4: Results for the three algorithms using var¬ 
ious recall measures for binary (1 in 2) and 1 in 10 
(1 in 10) next utterance classification %. 


We observe that the LSTM outperforms both 
the RNN and TF-IDF on all evaluation metrics. 
It is interesting to note that TF-IDF actually out¬ 
performs the RNN on the Recall® 1 case for the 
1 in 10 classification. This is most likely due to 
the limited ability of the RNN to take into account 
long contexts, which can be overcome by using the 
LSTM. An example output of the LSTM where the 
response is correctly classified is shown in Table|^ 
We also show, in Figure the increase in per¬ 
formance of the LSTM as the amount of data used 
for training increases. This confirms the impor¬ 
tance of having a large training set. 

Context 


""any apache hax around ? i just deleted all of 

_path_- which package provides it ?", 

"reconfiguring apache do n’t solve it ?" 


Ranked Responses 

Flag 

1. "does n’t seem to, no" 

1 

2. "you can log in but not transfer files ?" 

0 


Table 5: Example showing the ranked responses 
from the LSTM. Each utterance is shown after pre¬ 
processing steps. 

6 Discussion 

This paper presents the Ubuntu Dialogue Corpus, 
a large dataset for research in unstructured multi- 
turn dialogue systems. We describe the construc¬ 
tion of the dataset and its properties. The availabil¬ 
ity of a dataset of this size opens up several inter¬ 
esting possibilities for research into dialogue sys¬ 
tems based on rich neural-network architectures. 
We present preliminary results demonstrating use 
of this dataset to train an RNN and an ESTM for 
the task of selecting the next best response in a 

®Note that these results are on the original dataset. Results 
on the new dataset should not he compared to the old dataset; 
baselines on the new dataset will he released shortly. 




























Figure 3: The LSTM (with 200 hidden units), 
showing Recall® 1 for the 1 in 10 classification, 
with increasing dataset sizes. 

conversation; we obtain significantly better results 
with the LSTM architecture. There are several in¬ 
teresting directions for future work. 

6.1 Conversation Disentanglement 

Our approach to conversation disentanglement 
consists of a small set of rules. More sophisticated 
techniques have been proposed, such as training a 
maximum-entropy classifier to cluster utterances 
into separate dialogues [161. However, since we 
are not trying to replicate the exact conversation 
between two users, but only to retrieve plausible 
natural dialogues, the heuristic method presented 
in this paper may be sufficient. This seems sup¬ 
ported through qualitative examination of the data, 
but could be the subject of more formal evaluation. 

6.2 Altering Test Set Difficulty 

One of the interesting properties of the response 
selection task is the ability to alter the task dif¬ 
ficulty in a controlled manner. We demonstrated 
this by moving from 1 to 9 false responses, and 
by varying the Recall@k parameter. In the future, 
instead of choosing false responses randomly, we 
will consider selecting false responses that are 
similar to the actual response (e.g. as measured by 
cosine similarity). A dialogue model that performs 
well on this more difficult task should also manage 
to capture a more fine-grained semantic meaning 
of sentences, as compared to a model that naively 
picks replies with the most words in common with 
the context such as TF-IDF. 


6.3 State Tracking and Utterance Generation 

The work described here focuses on the task of re¬ 
sponse selection. This can be seen as an interme¬ 
diate step between slot filling and utterance gener¬ 
ation. In slot filling, the set of candidate outputs 
(states) is identified a priori through knowledge 
engineering, and is typically smaller than the set 
of responses considered in our work. When the 
set of candidate responses is close to the size of 
the dataset (e.g. all utterances ever recorded), then 
we are quite close to the response generation case. 

There are several reasons not to proceed directly 
to response generation. First, it is likely that cur¬ 
rent algorithms are not yet able to generate good 
results for this task, and it is preferable to tackle 
metrics for which we can make progress. Second, 
we do not yet have a suitable metric for evaluat¬ 
ing performance in the response generation case. 
One option is to use the BLEU ifTSl or METEOR 
HI scores from machine translation. However, 
using BEEU to evaluate dialogue systems has been 
shown to give extremely low scores 1^ . due to 
the large space of potential sensible responses fT|. 
Eurther, since the BEEU score is calculated us¬ 
ing N-grams IfTSl . it would provide a very low 
score for reasonable responses that do not have 
any words in common with the ground-truth next 
utterance. 

Alternatively, one could measure the difference 
between the generated utterance and the actual 
sentence by comparing their representations in 
some embedding (or semantic) space. However, 
different models inevitably use different embed¬ 
dings, necessitating a standardized embedding for 
evaluation purposes. Such a standardized embed¬ 
dings has yet to be created. 

Another possibility is to use human subjects to 
score automatically generated responses, but time 
and expense make this a highly impractical option. 

In summary, while it is possible that current lan¬ 
guage models have outgrown the use of slot fill¬ 
ing as a metric, we are currently unable to mea¬ 
sure their ability in next utterance generation in 
a standardized, meaningful and inexpensive way. 
This motivates our choice of response selection as 
a useful metric for the time being. 
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Appendix A: Dialogue excerpts 



Time 

User 

Utterance 


03:44 

Old 

1 dont run graphical ubuntu, 

I run ubuntu server. 

03:45 

kuja 

Taru: Haha sucker. 

03:45 

Taru 

Kuja: ? 

03:45 

bur[n]er 

Old: you can use "ps ax” 
and "kill (PID#)" 

03:45 

kuja 

Taru: Anyways, you made 
the changes right? 

03:45 

Taru 

Kuja: Yes. 

03:45 

LiveCD 

or killall speedlink 

03:45 

kuja 

Taru: Then from the terminal 
type: sudo apt-get update 

03:46 

_pm 

if i install the beta version, 
how can i update it when 
the final version comes out? 

03:46 

Taru 

Kuja: I did. 

Sender 

Recipient 

Utterance 

Old 


I dont run graphical ubuntu, 

I run ubuntu server. 

bur[n]er 

Old 

you can use "ps ax” and 
"kill (PID#)" 

kuja 

Taru 

Haha sucker. 

Tarn 

Kuja 

7 

kuja 

Taru 

Anyways, you made the 
changes right? 

Tarn 

Kuja 

Yes. 

kuja 

Taru 

Then from the terminal type: 
sudo apt-get update 

Tarn 

Kuja 

I did. 


Eigure 4: Example chat room conversation from 
the #ubuntu channel of the Ubuntu Chat Logs 
(top), with the disentangled conversations for the 
Ubuntu Dialogue Corpus (bottom). 


Time 

User 

Utterance 

[12:21] 

dell 

well, can I move the drives? 

[12:21] 

cucho 

dell: ah not like that 

[12:21] 

RC 

dell: you can’t move the drives 

[12:21] 

RC 

dell: definitely not 

[12:21] 

dell 

ok 

[12:21] 

dell 

lol 

[12:21] 

RC 

this is the problem with RAID:) 

[12:21] 

dell 

RC haha yeah 

[12:22] 

dell 

cucho, I guess I could 
just get an enclosure 
and copy via USB... 

[12:22] 

cucho 

dell: i would advise you to get 
the disk 

Sender 

Recipient 

Utterance 

dell 


well, can I move the drives? 

cucho 

dell 

ah not like that 

dell 

cucho 

I guess I could just get an 
enclosure and copy via USB 

cucho 

dell 

i would advise you to get the 
disk 

dell 


well, can I move the drives? 

RC 

dell 

you can’t move the drives, 
definitely not. this is 
the problem with RAID :) 

dell 

RC 

haha yeah 


Eigure 5: Example of before (top box) and after 
(bottom box) the algorithm adds and concatenates 
utterances in dialogue extraction. Since RC only 
addresses dell, all of his utterances are added, 
however this is not done for dell as he addresses 
both RC and cucho. 




